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Abstract — We consider a request processing system com- 
posed of organizations and their servers connected by the In- 
ternet. The latency a user observes is a sum of communication 
delays and the time needed to handle the request on a server. 
The handling time depends on the server congestion, i.e. the 
total number of requests a server must handle. We analyze 
the problem of balancing the load in a network of servers in 
order to minimize the total observed latency. We consider both 
cooperative and selfish organizations (each organization aiming 
to minimize the latency of the locally-produced requests). 
The problem can be generalized to the task scheduling in a 
distributed cloud; or to content delivery in an organizationally- 
distributed CDNs. 

In a cooperative network, we show that the problem is 
polynomially solvable. We also present a distributed algorithm 
iteratively balancing the load. We show how to estimate the 
distance between the current solution and the optimum based 
on the amount of load exchanged by the algorithm. During 
the experimental evaluation, we show that the distributed 
algorithm is efficient, therefore it can be used in networks 
with dynamically changing loads. 

In a network of selfish organizations, we prove that the 
price of anarchy (the worst-case loss of performance due to 
selfishness) is low when the network is homogeneous and the 
servers are loaded (the request handling time is high compared 
to the communication delay). After relaxing these assumptions, 
we assess the loss of performance caused by the selfishness 
experimentally, showing that it remains low. 

Our results indicate that a network of servers handling 
requests can be efficiently managed by a distributed algorithm. 
Additionally, even if the network is organizationally dis- 
tributed, with individual organizations optimizing performance 
of their requests, the network remains efficient. 

I. Introduction 

One of the most important aspects affecting the perceived 
quality of a web service is the delay in accessing the content. 
To avoid servers' congestion, the content of the web pages 
is commonly replicated in multiple locations. Additionally, 
in order to minimize the network latency, the replicas are 
placed close to users. Because the intensity of the web 
traffic changes dynamically, efficient mirroring requires both 
expensive infrastructure and effective load balancing algo- 
rithms. As the result many organizations decide to handle 
the task of mirroring their data to dedicated platforms — 
content delivery networks (CDNs) (16}, (27). The CDNs 
have been very successful in the recent years: Akamai [24 1, 
(26) , (311 , the largest CDN, handles around 15-20% of the 
Internet traffic. 



Consider an apparently different distributed system: a 
cloud of datacenters performing computationally-intensive 
parallel calculations. Each datacenter attempts to accelerate 
its calculations by distributing some of its load to less loaded 
and faster datacenters. However, the datacenters in remote 
locations must be avoided as the time needed to transfer the 
input and the result may dominate the processing time. 

Routing the requests in a CDN and distributing the load 
in a cloud are strongly related problems. In both there are 
systems of servers connected by the network (for simplicity, 
in cloud, we refer to a single datacenter as a "server", as 
we will not explore the parallelism inside a datacenter). The 
handling time on a server depends both on its performance 
metrics and its load. The final perceived latency comes from 
the network delays (required for transmitting the input data 
and the result) and from the handling time on the servers. 
Finally, in both cases every server has its initial load: in 
a CDN, the load is the current number of the data access 
requests to the server; in a cloud, the load is the number 
of initial tasks. We generalize these two problems to a load 
balancing of remote services. 

We assume that in the balanced system, the handling time 
of a single request on a server linearly depends on the total 
number of requests to be processed by the server. A linear 
dependency reflects a constant throughput of a server. In 
real systems, increasing the level of concurrency too much 
may overload the server decreasing its throughput (trashing). 
However, assuming that the amount of work in the system as 
a whole is reasonable, there should be no overloaded servers 
in the balanced state. Similar assumptions are usually taken 
in congestion games (12) , (29) and in the queuing theory, 
where a linear dependency is expressed by Little's law. 

We assume that the transmission duration of a single 
request does not depend on the number of sent requests. 
Although some models (e.g., routing games (25) ) consider 
the cases when the bandwidth of a link may become a bottle- 
neck, we focus on a widespread network, in which there are 
multiple cost-comparable routing paths between the servers. 
Thus, sending any data from one server to another should 
not significantly increase the network delay between them. 
These assumptions are also justified by our experiments - in 
Appendix we discuss how the intensity of the network load, 
generated between the servers, influences the RTT between 



the servers in PlanetLab environment. In other words, we 
consider that the load our system imposes on the network 
is negligible: thus, the network delay is caused only by 
the latency (resulting from e.g., geographical distribution). 
Our problem formulation assumes the knowledge of such 
latencies; this is not a limitation because monitoring the 
pairwise latencies, which can change in time, is a well 
studied problem with known solutions (e.g. see |9j, J32)). 
Optimizing latency is important for instance when streaming 
video files: a large latency delays the start, and, in case 
of communication problems, can be perceived as breaks in 
transmission. 

Balancing of the servers loads and finding the mirroring 
minimizing network delays are analyzed in the literature, but 
usually separately (see Section |VXJJL] >. Distributed systems 
should consider both communication and computation. On 
one hand, clouds get geographically distributed, thus cannot 
ignore network latency. On the other, a CDN handling 
complex, dynamically-created content of the modern web, 
can no longer ignore the load imposed by the requests. 

For delivering large static content, like multimedia, some 
currently used techniques cache the content at specially 
designated front-end servers. A particular server is chosen by 
the round-robin algorithm. This approach which is inefficient 
as, for instance, unpopular files are cached in multiple 
places. Benefits from optimizing requests redirections can 
be significant. For instance, fT0[ proposes a caching scheme 
that is both consistent (requests for the same content are 
redirected to the same front-end servers) and proportional 
(each server handles a desired proportion of requests). In 
this case, our algorithms can be viewed as a complementary 
optimization technique to caching - once the content must 
be downloaded from the back-end servers, we show how to 
efficiently distribute the download requests. 

In cloud computing, our model fits for instance processing 
streams of data in the real time or when the data stream is 
continuously produced and too large to be processed off- 
line. Consider a user interacting with a simulated virtual 
environment (e.g. pi): user's actions are captured by cam- 
eras; their image streams are analyzed in the real time to 
build a 3D model; then this model interacts with the virtual 
world model. Other applications include extracting statistics 
on users' actions in the Internet; or image analysis. 

In addition to a classic system with a central management, 
we analyze an organizationally-distributed system. Instead 
of a single, system-wide goal (minimize the overall request 
handling time), the organizations are selfishly interested only 
in optimizing the handling time of their local requests. This 
model reflects a CDN created as an agreement between 
e.g., many ISPs; or a federation of clouds, each having a 
different owner. Because typically the load changes dynam- 
ically, with peaks of demand followed by long periods of 
low activity, individual organizations are motivated to enter 
such a system: a peak can be offloaded, whereas handling 



foreign requests in the period of low activity is relatively 
inexpensive. 

The lack of central coordination in the organizationally- 
distributed system increases the average processing time. 
The price of anarchy [21] expresses the worst-case relative 
increase in the latency in comparison with relinquishing the 
control to a centrally-managed organization (like Akamai's 
CDN). As the price of anarchy varies considerably between 
systems (from relatively small in congestion games to un- 
bounded in selfish replication p0[), we were curious to 
check it in our system. 

Our contribution is the following: (i) We show that the 
problem of network delay-aware load balancing can be 
stated as an optimization problem in the continuous domain; 
the problem is polynomially solvable, although standard 
solvers have 0(m 6 ) complexity (ii) We propose a dis- 
tributed algorithm that iteratively balances the servers' load 
towards the optimum. We confirm the algorithm's efficiency 
through simulation: even on a single CPU it outperforms 
the standard solvers, (iii) In a network of selfish servers, 
we prove that the price of anarchy is low (1 + 0(2cs/l av )) 
if the communication delay between each pair of servers 
is the same and the request handling time on a server is 
significantly higher then the network delay. The experiments 
show that the loss of performance caused by the selfishness 
remains low (below 1.15) also without these assumptions. 
II. Mathematical model 

Organizations, servers, tasks The system consists of a 
set of m organizations, each owning a server (or a cluster) 
connected to the Internet. The servers are uniform; each 
server i has a constant processing speed s,. The z'-th organi- 
zation has its own load consisting of a large number of 
small, individual tasks (or requests). The amount of load n, 
can be considered as a number of tasks at a particular time 
moment (snapshot); or, alternatively, as a steady state rate 
of incoming requests in a system continuously processing 
requests. A task corresponds to, e.g., in a computational 
cloud, a unit-size computation (e.g.: a single work unit in a 
BOINC-type application; or a single invocation of a map- 
reduce function); or, in a CDN, a request for remote data 
coming from a user assigned to server i (typically, a user 
would be assigned to the closest server). In the basic model 
we assume that the small tasks have the same sizes (e.g. 
this corresponds to the divisible computation load; or in a 
CDN to the case where the stored data chunks have constant 
sizes); thus the execution of the single request on the z-th 



server takes time units. In Section VII we show how to 
easily extend our results to the tasks of different sizes. 

Relaying tasks, communication delays Each organization 
can relay some of its own requests to other servers. If the 

'We do not claim that no better centralized algorithm exists; however, due 
to distributed nature of the problem, we are more interested in proposing 
a distributed algorithm, rather than tuning a centralized algorithm. 



2 



request is relayed, the observed handling time is increased 
by the communication latency on the link. We denote the 
communication latency between z'-th and j-th server as c,- 7 
(with cu = 0). Since communication delay of a single request 
does not depend on the amount of exchanged load (which 
is explained in Section [I] and which is confirmed by our 
experiments on PlanetLab - see Appendix) cy is just a 
constant instead of a function of the network load. We 
assume that the routing in the system is correct (optimized 
by the network layer). Thus, we will not consider optimizing 
communication time by relaying requests from i to j through 
a third server k (if c^ + c^j < Cjj, the network layer would 
also discover the route i — > k — > j and update the routing 
table accordingly, so that c, ; - := c^ + c^j). We assume that 
each request can be sent to and executed on any server. 
However, if we set some of the communication delays to 
infinity, we restrict the basic model to the case when each 
organization is allowed to relay its requests only to the given 
subset of the servers (its neighbors) which models e.g. the 
trust relationship. 

Relay fractions, current loads We use a fractional model 
in which a relay fraction Pij denotes the fraction of the z'- 
th organization's own requests that are sent (relayed) to be 
executed at y'-th server (Vijpij > and V; JjjZi Pij — !)• The 
load balancing problem is to find the appropriate values of 
the relay fractions (formalized in the further part of this Sec- 
tion). Once the fractions are known, each organization knows 
how many of its own requests it should send to each server; 
the tasks are sent and executed at the appropriate servers. 
The fractional model might be considered as a relaxation of 



a problem of handling non-divisible requests; in Section VII 
we show how to round the solution of a fractional model to a 
discrete model. Moreover, the fractional model itself fits the 
divisible load model used in the scheduling theory. For the 
sake of the clarity of the presentation we use the additional 
notation for the number of requests redirected from server i 
to j - rij (thus Tij = riiPij), and for the current load of the 
server z, i.e. the number of requests relayed to z by all other 
organizations, including the organization owning the server 
itself - U (thus, h = E7 =1 rji ). 

Completion times We don't assume any particular order 
of requests executed on a server. First, since the number 
of requests is large, considering any particular order on 
the servers would increase the computational complexity. 
Second, in a continuously running systems, we have no 
control over the order in which requests are produced 
(especially as they can be also delayed by the network); 
the usual FIFO policy results in an arbitrary order. Thus, 
for each of the lj request that are actually processed on j- 
th server, the expected processing time of each request is 
equal to 'V 5 ; = lj/2sj (constant omitted for clarity). 

Since z'-th organization relayed r, 7 requests to j, the expected 
total completion time of requests relayed by i to j is equal to 



rjj(lj/2sj+Cjj). The expected total processing time C,- of the 
z'-th organization's own requests is a sum over all servers j 
of the expected total completion times of the requests owned 
by i and relayed to 
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(1) 

We consider the expected (or the average) processing 
time, rather than the makespan of an organization for several 
reasons. The average processing time is similar to the 
widely-used sum of processing times criterion (EC;)- We 
assume that the workload of each organization is created 
by many users. EC; models users' performance better than 
the makespan |T5]. In all the contexts motivating our work 
from Section [I (e.g., processing streams of data in the real 
time, delivering content to the users) we are focused on 
the average user performance. Also, while C; depends on 
the vector p — [pjy] quadratically, the relation between the 
makespan and p is just linear, which makes the problem 
considerably easier. Thus, we believe that some of our results 
could be adapted for the cases when some different from the 
pointed applications of our model would require optimizing 
the makespan. 

The total processing time of all the requests in the system 
is denoted as £C, = E/I 



= 1 <-;. 



Problem formulation We consider two related problems. 
First, the goal is to find such a vector of the fractions, p, that 
the total processing time of the requests, £C;, is minimized. 
This goal corresponds to a centrally-managed system having 
a unique owner and a single goal. 

Second, we analyze the case when servers are a common 
good, but each organization is selfishly minimizing the 
processing time of its own requests. The z'-th organization 
is responsible for sending its own requests to appropriate 
servers. In other words, the z'-th organization adjusts the 
values of p i; - in order to minimize Q. This approach is 
similar to the selfish job model [34|, in which jobs selfishly 
choose processors to minimize their execution time. Similar 
agreements exist in real-life systems: e.g., PlanetLab servers 
are treated as a common good managed by a central entity; 
PlanetLab users choose the servers they want to use for 
their experiments. Also, in academic grids (e.g. Grid5000 in 
France), participating organizations grant control over their 
resources to a central entity; in return, users can submit their 
jobs to any resource. In this case, we look for such a vector 
of the fractions p for which the system reaches the Nash 
equilibrium. By comparing the resulting £C; with the result 
for the centrally-managed system, we will find the price of 
anarchy, quantifying the effect of selfishness on the total 
processing time. 
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Figure 1: Matrix Q: X denotes non-zero values 

III. Optimal solution 

In this section we assume that there is a central processing 
unit that has the complete knowledge about the whole 
system. Given the communication latencies cy and the 
organizations' own loads n,, our goal is to find an algorithm 
setting relay fractions p, ; - so that the total processing time of 
all the requests £C; is minimized. We express the problem 
as a quadratic programming problem. We show that the 
problem is polynomially-solvable. 

We express the total processing time £C; in a matrix form 
as £Q = p T Qp+b T p, where: 

• p is a vector of relay fractions with m-m elements. P(;j), 
the element at (i-m + j)-th position, denotes the fraction of 
local requests of z'-th server that are relayed to j-th server 
Pij, thus: 

• • • i P(l,m)> P(2,l)) ■ ■ ■ ! Pim.m)} 1 5 

matrix in which Q(ij) (k,i) denotes the 
-j)-th row and in (fe-m + /)-th column: 



P = [P(1,1)>P(1,2), 

• Q is m 2 -by-m 2 
element in (i-m- 



nin^/sj if j ' = I and i < k; 
nin^jT-Sj if j — I and i = k; 
otherwise; 



(2) 



Figure [T] presents the structure of matrix Q. 
• b is a vector with m 2 elements with bjj denoting an 

element at (i ■ m + j)-th position: bnj\ =Cijnj. 

The following derivation shows how the matrix Q is 
constructed: 



p'QP £p(y E'/,,/ a,/ Ps., 

i,j k>i 
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Si 
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(3) 



(4) 



(5) 



Q follows from the construction of the matrix Q (only 
elements k > i are non-zero). (j4j) substitutes ^(ij),^/) with 
the values defined in |5]l uses commutativity of multi- 
plication and substitutes lj ~Y,k n kP(kj) an d Hj =n iP(i,j)- 

The constraints that p, ; - are the fractions (V; / Py > and 
V; Y, J j=i Pij — 1) can a l so t> e expressed in the matrix form. 



input: - the identifiers of the two servers 

Data: Vj f>, - initialized to the number of requests owned by k and 

relayed to i (V* is defined analogously) 
Result: The new values of ru and r/y 
foreach A: do 

m <-»•«+ ^iy; Orj<-0; 

end 

servers <— sort [k] so that — c& < c^/; — Cj/; A: is before 

foreach A: G servers do 



Ar, 

if Ar^; > then 

m <~ r ki - kr ikj ; r kj <- r kj + Ar ikj ; 
li <- /,■ - Ar ft y; / ; - <- lj+Ar ikj ; 

end 

end 

return /or ec/c/i A:: r^,- and rjt ; 



Algorithm 2: Min-Error (MinE) algorithm performed by server id. 

Notation: impr (i, j) <— this function calculates the improvement of 
Y,Cj when transferring requests between i and j. The 
number of requests that should be transferred can be 
computed by calcBestTransf er (i, j) . 

partner <— argmaxy(impr(ifl*, J)); 

rely (id, partner, calcBestTransf er (id, partner) ); 



First, p > m 2, where m 2 is a vector of length m 2 consisting 
of zeros. Second, Ap = l m , where l m is a vector of length m 
and consisting of ones, and A is a m-by-m 2 matrix defined 
by the following equation: 



1 if im < j < (i+ l)m 
otherwise. 



(6) 



Minimization of £C,(p) = p T Qp +b T p with constraints 
P > m 2 and Ap = \ m is an instance of quadratic programing 
problem. As an upper triangular matrix, matrix Q has m 2 
eigenvalues equal to the values at the diagonal: nf/2sj 
(1 <i,j< m). All eigenvalues are positive so Q is positive- 
definite. Thus, the problem can be solved by the ellipsoid 
method in polynomial time \22\. According to | |20| , the 
best running time reported for solving quadratic programing 
problem with linear constraints is Oirr'L) |l9| , where L 
represents the total length of the input coefficients and n 
the number of variables (here n — m 2 ), so the complexity of 
the best solution is 0(Lm 6 ). 

IV. Distributed algorithm 

The centralized algorithm requires the information about 
the whole network - the size of the input data is 0(m 2 ) 
and the Q matrix has (9(m 3 ) non-zero entries. A centralized 
algorithm has thus the following drawbacks: (i) collecting 
information about the whole network is time-consuming; 
moreover, loads and latencies may frequently change; (ii) 
a standard solver takes significant time (recall 0(Lm 6 ) in 



Section IIIi; (iii) the central algorithm is more vulnerable 
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to failures. Motivated by these limitations we introduce a 
distributed algorithm for finding the optimal solution. 

The distributed algorithm requires that each server has 
up-to-date information about the loads on the other servers 
and about the communication delays from itself to the other 
servers (and not for all pairs of servers). Thus, for each 
server, the size of the input data is 0(m). As indicated 
in Section [TJ the problem of monitoring the latencies is 
well-studied. The loads can be disseminated by a gossiping 
algorithm. As gossiping algorithms have logarithmic con- 
vergence time, if the gossiping is executed about O(log(m)) 
times more frequently than our algorithm, each server has 
accurate information about the loads. 

Each organization, i, keeps for each server, k, the infor- 
mation about the number of requests that were relied to i by 
k. The algorithm iteratively improves the solution - the ;-th 
server in each step communicates with the locally optimal 
partner server - j (Algorithm The pair locally 
optimizes the current solution by adjusting, for each k, r k i 
and r k j (Algorithm [TJi. In the first loop of the Algorithm [T| i, 
one of the servers, takes all the requests that were previously 
assigned to i and to j. Next, all the organizations [k] are 
sorted according to the ascending order of (cy — c#). The 
lower the value of (cw — cjy), the more profitable it is to run 
requests of k on j rather than on i in terms of the network 
topology. Then, for each k, the loads are balanced between 
servers i and j. 



In Section III we have shown that the optimization prob- 
lem is convex. Thus, it is natural to try local optimization 
techniques. The presented mechanism requires only two 
servers involved in each optimization step, thus it is very 
robust to failures. This mechanism is similar in spirit to 
the diffusive load balancing QJ, Q, (6|; however there are 
substantial differences related to the fact that the machines 
are geographically distributed: (i) In each step no real 
requests are transferred between the servers; this process 
can be viewed as a simulation run to calculate the relay 
fractions p,y. Once the fractions are calculated the requests 
are transfered and executed at the appropriate server, (ii) 
Each pair of servers exchanges not only its own 

requests but the requests of all servers that relayed their 
requests either to i or to j. Since different servers may have 
different communication delays to i and j the local balancing 
requires more care (Algorithms [T] and [2]). 

A. Correctness 

The following Lemma shows how to optimally exchange 
the requests owned by organization k between a pair of 
servers i and j. 

Lemma 1. Consider two servers i and j that execute r k i and 
r k j requests of the k-th organization. The total processing 
time, £C„ is minimized when the k-th server relies Ar^j 



from r%i requests to be additionally executed on j-th server: 

. i _ {Sjk - Sjlj) - SjSj(c k j - C ki ) 

Ar ikj = max(0,mm(r ki ,Ar' ikj )) 

Proof: If the k-th server moves some of its requests 
from i to j, then it affects the completion time of all requests 
that were relayed either to i or to j (initial requests of all 
servers). Recall that /,- and lj are the loads of the servers, 
respectively, i and j, that is they include all tasks relayed to, 
respectively, i and j. Thus, if k removes Ar of its requests 
from i, then the new processing time of all tasks on the 
server i will be (/; — Ar) 2 /2s,-. Thus, we want to find Ar,-, ; - 
that minimizes the function /: 



, . (li-Ar) 2 (lj+Ar) 2 
/(Ar) = U - ' + K \ ' - Arc,, + Arc,, 



2s i 



2s : 



We can find minimum by calculating derivative: 



df _ Ar-h Ar + lj 
dAr Sj s j 



- Cki + c kj = 



Ar ikj = 



(Sjlj - Sjlj) - SjSj(c kj - c ki ) 
(Si + Sj) 



Also Are (0,r,,), which proves the thesis. ■ 
The following lemma proves the correctness of Algo- 
rithm Q] 

Lemma 2. After execution of Algorithm [7]/or the pair of 
servers i and j, it is not possible to improve £C,- on fy by 
exchanging any requests between i and j. 

Sketch of Proof: First we show that after the second loop 
no requests should be transferred from i to j. For each 
organization k the requests owned by k were transferred from 
i to j in some iteration of the second loop; also, each of 
the next iterations of the second loop could only cause the 
increase of the load of j (and decrease of z); thus transferring 
more requests of k from i to j would be inefficient. Second, 
we will show that after the second loop no requests should 
be transferred back from j to i either. Let us take the last 
iteration of the second loop in which the requests of some 
organization k were transferred from i to j. After this transfer 
we know that Ar,-, ; - = i s jhz s ^^hT c Ml > (otherwise 
the transfer would not be optimal). However, this implies 
that Ar,-,,,- = M-rtywAvrVi) > for each server k , 

considered before k. As Ar,-,/ ; - > we get Ar ; -,/,- < 0. □ 



B. Error estimation 

The following analysis bounds the distance of the current 
solution of the distributed algorithm to the optimum as a 
function of the disparity of servers' load. When running the 
algorithm, this result can be used to assess whether it is 
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still profitable to continue running the algorithm: if the load 
disparity is low, the current solution is close to the optimum. 

We introduce the following notation for the analysis, p' 
is the snapshot (the current solution) derived by distributed 
algorithm, p is the optimal solution that minimizes £C; (if 
there are multiple optimal solutions with the same £C,- > P i s 
the closest solution to p' in the Manhattan metric). (P,Ap) 
is a weighted, directed error graph: Ap [/][/] indicates the 
number of requests that should be transferred from server 
i to j in order to reach p from p' (Ap requests either 
belong to i, or to j, and not to another server k). We define 
dir as the direction of transport: dir(/, j) = 1 if z" transfers 
to j its own requests; dir(i, j) = — 1 if i returns to j the 
requests that initially belonged to j. Let succ(i) denotes the 
set of successors in the error graph: succ(i) = { j : Ap [;] [j] > 
0}; precij) denotes the set of predecessors: prec(i) = {j : 
Ap [;•][/]>()}. 

In the error graph, a negative cycle is a sequence of 
servers ii, £3, — , J„ such that (i) i\ = /„; (ii) Vy 6 {i v ,. n _i} 

AplOHO+i] >°; and ( m ) E"=i ( M l Vy+i) c «/i/+i <0 - 

A negative cycle is sequence of servers that essentially 
redirect their requests to one another. A solution without 
negative cycles has a smaller processing time: after disman- 
tling a negative cycle, loads on servers remain the same, 
but the communication time is reduced. In Appendix, we 
show how to detect and remove negative cycles; in order 
to simplify the presentation of the subsequent analysis, we 
consider that there are no negative cycles. 

Proposition 1. If ( i) the error graph Ap has no negative 
cycle; and (ii) L,max<.((— + —)Arjk) = AR (Ary is the 
number of requests which in the current state p would 
be relied to j-th server by the i-th server (as the result of 
Algorithm^, then ||p — p'||i < (4/n+l) A/? where ||-||i 
denotes the Manhattan metric. 

Proof: First we show that there is no cycle in the error 
graph. By contradiction let us assume that there is a cycle: 
z'l,... ,i n -i,i n (with ;'i = i„). Because the error graph has no 
negative cycle, we have: L"=' dir( l V'0+i) c V'0+i > 0. Now, 
if we reduce the number of requests sent on each edge of 
the cycle: 

A P[ ! 'j> ij+l] : = A P [*./< h+i] ~ min ke{L...,n-l}(P Mfc+lD 

then the load of the servers ij,j E {l,...,n — 1} 
will not change (each server both receives and sends 
min^i n -i}(p[ik,ik+i}) requests less). Additionally, the 
transfers in the network are reduced. Thus, we get a new 
optimal solution which is closer to p' in Manhattan metric, 
which contradicts that p is the optimal. 

Second, we show how to bound the distance to the 
optimum solution on a server by transfers and loads on 
neighbors. Let /, be the load of the i-th server in the optimal 
solution p. Let l\ be the load of the i-th server in state p'. 



Consider a server i for which Z; < I'. In state p', in order 
to balance i and j € succ(i), at most Ary requests must be 
transferred (Lemma |2jl. For each k, Ar^j depends on the 
difference between weighted loads l-sj — I'.Si (see the thesis 
of Lemma [TJ. Thus, by Lemma [2] in the current state, i and 
j would be balanced if the difference in weighted loads is 
at most D = (/■ — Ary)s ; - — (/'• + Ary ),$,•. In the optimal state, 
the weighted load of j is at least Sj(l'- — max((/' — Z/),0)). In 
the optimal state, j must be also balanced with i, thus, the 
difference in weighted loads is at most D. By solving for the 
reduction of load on i, we get that l\ is decreased by at most 
^y^Ary+max(^(/'- — (y),0). In the optimal solution, all the 
pairs of servers are balanced; thus, the difference between 
the current and the optimal load can be bounded by: 

-(/•-/<)< max ((- + -)Arx/ + -max(/'--/;,0)) (7) 
$i jesucc(i) Si sj sj 

Eq. [7] holds also for any server j £ succ{i) for which lj < 
I'j. It can be recursively expanded until reaching the servers 
without successors in the error graph (there are no cycles 
in the graph). The resulting expanded equation takes into 
account the path constructed by the maximum imbalance 
(iasLX.jesucc(i))'> me cost °f this P am i s bounded by the cost 
of all the imbalances, which leads to: 



max 

k 



1 1 



i- + -) Ar 



■ s,AR 



For the servers with /,■ > l[, we can obtain the similar esti- 
mation by taking the set of predecessors prec and expanding 
the inequalities towards the servers that have no predecessors 
instead of moving towards those without successors. 

Eq. [7] considered loads /,; the imbalance of transfer 
Ap [/][/'] can be similarly bounded by Ap [/][/'] < (Ar !; - + 
f^fX\^'i ~ h \ + ~ Ar ij is what would be transfered in 
the current state; and — Z;| takes into account the transfers 
that might be triggered by the future changes in the load; the 



transfers are proportional to the relative speed 



Sf+S: 



Finally: 



IP-P'IIi =£I>[*][j] <££(Ar i7 + 4 Si Atf) 

i j i j 

< (4m+l)A/?£Vi 



Proposition [T] gives the estimation of the error for such 
partial solutions that do not have a negative cycles. Therefore 
the algorithm that cancels negative cycles (see Appendix) 
should be run whenever the estimation for distance to the 
optimal solution is needed. Our experiments show, however, 
that the negative cycles are rare in practice and that pure 
Algorithm [2] can remove them efficiently (Section VI 1. 



V. Selfish organizations 

In this section we consider the case when the organiza- 
tions are acting selfishly - the i-th of them tries to minimize 
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the total processing time of its own requests - Q. We are 
interested in a steady state in which all the peers have no 
interest in redirecting any of its requests to different servers 
- the Nash equilibrium. 

A. Homogeneous network 

In this section we present the characteristic of the Nash 
equilibrium in case when all the servers have equal process- 
ing power (Vj Sj = s), and when all the connections between 
servers have the same communication delay (Vy c i; - = c). We 
consider homogeneous model, as the modeling of a hetero- 
geneous interconnection graph is complex. The simulation 
experiments (Section [VI-C| l show that in the case of selfish 
servers the average relative degradation of the system goal 
on heterogeneous networks is similar to, or lower than on 
the homogeneous networks. 

Lemma 3. For every two servers i and j the difference 
between their average loads is bounded: |Z,- — I A <c-s 

Proof: (by contradiction) Assume |/,- — I A > c s. With- 
out loosing the generality, Z, > lj. Recall that ry is the 
number of redirected requests ry = n/Py. For each sever k 
(k 7^ /), it is not profitable to put more of its requests to the 
more loaded server, so rjy > r#. Now we want to find the 
relation between Z,-, lj,ru and r,-,. In a Nash equilibrium, it 
is not profitable for i to redirect any additional x of its own 
requests from itself to j, which can be formally expressed 
by the equation: 

(h-x^m-x) (ij+x)( ri j+x) 

" 2s + 2s + C V*]+ X ) 

o ~ Cri i' 
2s 2s 

equivalent to: 

r ij — Hi + 2x > U — lj — 2c ■ s. 

Because the inequality must hold for every positive x, and 
because /,■ — l> > c-s 

?ij ~ r ii > c ■ s — 2c ■ s = — c ■ s 
Now we can show the contradiction, because 

k=m k—m 

lj= Y, r kj> £ r ki ~ c ' s = h ~ c ' s 

k=l k=l 

from which it follows that 



li — lj<C- s. 



Let us denote the average load on the server as l av , 
thus l m — ^L;=™Z;. The following theorem gives the tight 
estimation of the price of anarchy when the servers are 
loaded compared to the delay (l m , ^> 2cs). (If the servers 



are not loaded, our estimation of the price of anarchy is 
dominated by 0{{j^) 2 ) element). 

Theorem 1. The price of anarchy in the homogeneous 
network is: PoA = \ + & + 0((f±) 2 ). 

l a v ^ ^ l a v 

Proof: (upper bound) We denote the load imbalance 
on the i'-th server as A, = /, — l av . It follows that Lj='"A, = 
0. Also, from Lemma [3] we have A, • < c • s. Additionally, 
each request can be relied at most once, thus the total time 
used for communication is bounded by ml av c. Therefore, the 
total processing time in case of selfish peers, £Q(self) is 
bounded: 

— — = ml av c + > 
Is 

i 

l2 ml 2 
2s 



£C/(self) < ml av c- 



2s 



T- 

h 2s 



- mlnvC < 



2s 



2 

mc s 



- ml,, 



The total processing time is the smallest when the servers 
have equal load (each server processes exactly l av requests) 
and do not communicate, thus the optimum is bounded by 



2s 



Thus, the price of anarchy is bounded by: 

ml} m +2ml av cs 
ml 2 



PoA < 



2 2 

- mc s 



= 1 



2cs ,cs s ~, 

I'm) lav 



(tightness) Consider an instance with servers having equal 
initial load: V,- n; = l av - 

In the optimal solution no requests will be redirected. 

When servers are selfish, the j-th server will redirect to 
;'-th server (i ^ j) 



2c ' s requests and will execute (2c • s - 
of its own requests on itself. As a result: /, = l av . 



This is a Nash Equilibrium state, because it is not prof- 
itable for any server to redirect any x more of its own 
requests to the other server, nor to execute any x more 
requests on itself instead of some other server, as the two 
following inequalities hold for every positive x: 



lav -2C-S 



0< 



2s 



lav 

2s 
\-x 

lav 

2s 



{2c- s- 
(2c • s -t 
'{2c-s- 





m 


lav 


— 2c-s 




m 


lav ' 


- 2c • s 




m 


lav 


— 2c ■ s 



-x) + 



lav / lav ' $ , \ 

( + x ) 

m 



2s 



lav / lav 2c • S . 

2s m 



lav % / lav • S . 

— ( x) 

Is m 



■2c ■ s 



Thus, we get the lower bound on the price of anarchy: 



PoA > 



mh 



lgy — 2C-S 



c ■ 2s 



= 1 



m{lav — 2c-s — 
ml 2 v 

2{l av -2c 2 s 2 ) 



2cs . sc 

i — 4( r 

L av L av 



il 2 



> 1 



2ci 

I /IV 



-4(^) 2 



Summarizing: 
2cs 

~Ta~v 



1 



-4( — ) 2 <PoA<\ 

lav 



2cs 
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Figure 2: The convergence of the distributed algorithm for peak distribution 
of initial loads. 





# iterations 


average 


max 


st. dev. 


m < 50 


uniform 


1.65 


3 


0.49 


exp. 


2.35 


3 


0.47 


peak 


4.87 


6 


0.71 


m - 100 


uniform 


2.0 


2.0 


0.0 


exp. 


2.62 


3 


0.48 


peak 


6.88 


7 


0.32 


m - 200 


uniform 


2.1 


3 


0.33 


exp. 


3.1 


4 


0.33 


peak 


7.84 


8 


0.37 


m - 300 


uniform 


2.0 


2 


0.0 


exp. 


3.25 


4 


0.43 


peak 


8.0 


8 


0.0 



Table I: The number of iterations of the distributed algorithm required to 
obtain at most 2% relative error in the total processing time EC,. 



The price of anarchy depends on the average load on the 
server and on the network delay. For the more general case, 



in Section VI-C we present the estimations derived from 
simulations. 

VI. Simulation Experiments 

In this section we show the results from the two groups of 
experiments. First, we investigate convergence time of the 
distributed algorithm. Second, we assess the loss of perfor- 
mance in an organizationally-distributed system compared 
to the optimal, central solution. The loss is computed as a 
ratio of the total processing times. 

A. Settings 

We experimented on two kinds of networks: homoge- 
neous, with equal communication latencies (c i; - = 20); and 
heterogeneous, where latencies were based on measurements 
between PlanetLab node^j expressed in milliseconds^] 

In the initial experiments, we analyzed networks com- 
posed of 20, 30, 50, 100, 200 and 300 serves. We also 
performed some experiments on larger networks (500, 1000, 
2000, 3000 servers). The processing speeds of the servers s, 
were uniformly distributed on the interval (1,5). 

We conducted the experiments for exponential and uni- 
form distribution of the initial load over the servers. For each 
distribution we analyzed five cases with the average load 
equal to 10, 20, 50, 200 and 1000 requests (assuming that 
processing a single request on a single server takes 1ms). We 
also analyzed the case of peak distribution - with 100.000 
requests owned by a single server. 

We evaluated the result based on the distance to the 
optimal solution, which because of the 0(m 6 ) complexity 
of standard solvers (see Section [IrT[ > was approximated by 
our distributed algorithm. 



-http://iplane.cs.washington.edu/data/data.html 

3 The dataset does not contain latencies for all pairs of nodes, so we had 
to complement the data by calculating minimal distances. 



B. Convergence time of the distributed algorithm 

In the first series of experiments, we evaluated the effi- 
ciency of the distributed algorithm measured as the num- 
ber of iterations the algorithm must perform in order to 
decrease the difference between the total processing times 
in the current and the optimal requests distributions to less 
than 2% of the average load. In a single iteration of the 
distributed algorithm, each server executes Algorithm [2] if 
there were many pairs of the servers to be optimized we run 
optimization in the random order. Table [I] summarizes the 
results. 

The results indicate that the number of iterations mostly 
depends on the size of the network and on the distribution of 
the initial load. The type of the network (planet-lab vs. ho- 
mogeneous) does not influence the convergence time. Larger 
networks and peak distribution result in higher convergence 
times. In all considered networks, the algorithm converged 
in at most 9 iterations. 

Next, we decreased the required precision error from 2% 
to 0.1%, and ran the same experiments. The results are 
given in Table [IT| In this case, similarly, the required number 
of iterations was the highest for peak distribution of the 
initial load. In each case the algorithm converged in at most 
11 iterations. Even for 300 servers the average number of 
iterations is below 8. Also, the standard deviations are low, 
which indicates that the algorithm is stable with respect to 
its fast convergence. 

Also, we assessed whether a variation of the distributed 
algorithm that does not eliminate negative cycles (Ap- 
pendix [A} has a slower convergence time. Although required 
to prove the convergence (Section IV-B I, eliminating the 
negative cycles is complex in implementation and dominates 
the execution time. 

We compared two versions of the distributed algorithm: 
without negative cycle removal; and with the removal every 
two iterations of the algorithm. The number of iterations 
for two versions of the algorithm were exactly the same 
in all 6000 experiments. These result show that the cycles 
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# iterations 


average 


max 


st. dev. 


m < 50 


uniform 


5.1 


7 


1.0 


exp 


5.5 


7 


0.9 


peak 


6.4 


7 


0.5 


m - 100 


uniform 


5.8 


9 


1.6 


exp. 


6.3 


9 


1.5 


peak 


8.0 


9 


0.2 


m - 200 


uniform 


6.1 


9 


2.2 


exp. 


7.1 


10 


2.0 


peak 


9.9 


10 


0.3 


m - 300 


uniform 


6.2 


10 


2.4 


exp. 


7.7 


11 


2.0 


peak 


10.0 


10 


0.0 



Table II: 

obtain at 



The number of iterations of the distributed algorithm required to 
most 0.1% relative error in the total processing time EC,*. 









Ratio 








avg. 


max 


st. dev. 




l av < 30 


ctj = 20 


1.041 


1.098 


0.029 




PL 


1.014 


1.049 


0.007 




lav = 50 


cy = 20 


1.114 


1.150 


0.031 


c 
o 


PL 


1.011 


1.033 


0.006 


o 


lav > 200 


c u = 20 


1.024 


1.055 


0.018 




PL 


1.003 


1.022 


0.003 




lav < 30 


Cj = 20 


1.000 


1.022 


0.001 




PL 


1.000 


1.000 


0.000 


uniforn 


lav = 50 


cy = 20 


1.041 


1.062 


0.018 


PL 


1.000 


1.000 


0.000 


lav > 200 


c i; = 20 


1.001 


1.029 


0.006 




PL 


1.000 


1.000 


0.000 



Table III: Experimental assessment of the cost of selfishness: ratios 
between total processing times in cases of selfish and cooperative servers. 



which happen in practice can be efficiently removed by pure 
Algorithm [T] Also, the negative cycles are rare in practice. 

Finally, we analyzed the convergence of the distributed 
algorithm without negative cycles elimination on larger net- 
works (Figure [2}. The previous experiments shown that the 
algorithm convergence is the slowest for peak distribution of 
the initial load, therefore we chose this case for the analysis. 
The experiments used heterogeneous network. The results 
indicate that even for larger networks the total processing 
time decreases exponentially. 
C. Cost of selfishness 

In the second series of experiments we experimentally 
measured the cost of selfishness as the ratio between total 
processing times in cases of selfish and cooperative servers 



(Table IIIi. In each experiment, the Nash equilibrium was 
approximated by the following heuristics. Each server was 
playing its best response to the current distribution of 
requests. We terminated when all servers in two consecutive 
steps changed the distribution of their requests by less than 
1%. We computed the ratio of the total processing times: 
the (approximated) Nash equilibrium to the optimal value. 

The cost of selfishness is low. The average is below 
1.06; and the maximal value is below 1.15. The estimation 
of the cost of selfishness is higher in case of constant 
processing rates It additionally depends on the ratio 



between the average initial load and the network latency 
and on the structure of the network. The highest cost is for 
homogeneous networks with constant processing rates and 
having medium initial load about 2 times longer than the 
mean communication delay. The experiments show that the 
cost of selfishness is independent of the size of the network 
and the type of distribution of initial loads. 

VII. Extension: requests of different processing 
times; replication 

Up to this point, we modeled a distributed request pro- 
cessing system, in which requests have the same size. In 
this section we show how our results extend to the model 
where the individual requests (constituting the load) have 
different durations and where the requests additionally have 
redundancy requirements (These extensions are particularly 
relevant for the problem of finding the replica placement in 
CDNs - here different data pieces have different popularities 
and data redundancy is a common requirement for increasing 
the availability). 

We introduce the following additional notation. A task is 
an individual request. 7, = {/,(£)} denotes the set of tasks 
of organization z; pi{k) is the size (processing time) of the 
task Ji(k). 

First, let us analyze a problem in which the tasks have no 
redundancy requirements, i.e. each task has to be processed 
on exactly one server. 

In order to find the optimal solution in this extended 
model, we start with solving the original problem (as defined 
in Section |ll| with «, = Y,k Pi(k)- I n order to derive the actual 
distribution of the tasks, we discretize the fractions py as 
follows, i should relay to j such subset S,-(j) C 7, of its own 
tasks, so that the total error Lerr(Sj(j)) is minimized: 



err(Si(j)) 



I E Pi(k)- 
k:Ji{k)eSi{j) 



■>l]<H\ 



The rounding problem is the multiple subset problem 
with different knapsack capacities [8]. The problem is NP- 
complete but has a polynomial approximation algorithm. 

Now consider a problem in which each organization must 
execute at least R copies of each task; each copy of the task 
should be executed at a different location (the execution of 
the tasks is replicated). This setting models a CDN, but also 
job processing, where to increase survivability important 
jobs are replicated on different partions of a datacenter or 
on different datacenters. 

In this extended problem we have to introduce additional 
constraint on the fractions py for the original problem 
(Section |nji: VjjPy < ^, which guarantees that Rpy < 1. 
With this constraint we can interpret Rpij as the probability 
of placing a copy of //(&) at j; here the expected number of 
copies of Ji(k) is £</?pij = R- 
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VIII. Related work 



The congestion games jT2j, |21|, (25], (29) define the 
model for analyzing the selfish behavior of users competing 
for commonly available resources. Similarly to our model, 
the cost of a particular resource is linearly proportional to the 
number of competitors using the resource. In contrast, our 
model more closely describes the cost of using a resource 
which depends also on the communication delay. 

The assumptions in our model are similar as in the 
literature on network virualization [4|. However in network 
virualization the problems regard locating services which is 
different from optimizing the quality of serving the common 
user requests. The complexity of the solutions depend on the 
number of configurations (which here is unbounded) thus the 
solutions cannot be applied to our model. 

The continuos allocation of requests to servers in our 
model is analogous to the divisible load model (17) with 
constant-cost communication (a special case of the affine 
cost model (5)) and multiple sources (multiple loads to be 
handled, (14) , (33)). The main difference is the optimization 
goal: makespan is usually optimized in the divisible load 
model; in contrast, we optimize the average processing time, 
which, we believe, better models situations in which the 
load is composed of multiple, small requests issued by 
various users (the difference is analogous to C max versus 
EC,- debate in the classic multiprocessor job scheduling). The 
other difference is how the network topology is modelled. 
The divisible load theory typically studies datacenter-type 
systems, in which the network topology is known and is a 
limiting factor, thus the transmissions must be scheduled in 
a similar way to the computations. 

Distributed algorithms for load balancing mostly relay 
on local optimization techniques (see (TJ, (6j, (35)). One 
of the most popular techniques is diffusive load balancing, 
similar in spirit to our distributed algorithm (see (2) and the 
references inside for the current state of the art and (35) for 
the basic mechanism description). These solutions, however, 
disregard the geographic distribution of the servers. Our 
algorithm uses different idea - the diffusive process is used 
for calculating the relay fractions instead of for balancing 
the load. As the result, our local balancing must take into ac- 
count different latencies between the servers which requires 
more subtle exchange mechanisms (Algorithms [T] and [2). 

Our game-theoretic approach is comparable to the selfish 
job model [34]: the jobs independently chose the processor 
on which to execute. While some studies consider mixed 
case equilibria (making the model continuous similarly to 
ours), our model considers also communication latency. The 
common infrastructure models tend to have a low price of 
anarchy (of order log mj log log m |34|) — the low price of 
anarchy in our model extends these results. 

Content delivery networks are one of the motivations for 
our model. Large companies, like Akamai, specialize in 



delivering the content of their customers so that the end users 
experience the best quality of service. Akamai's architecture 
is based on DNS redirections [23], (24) , (31) . However, the 
description of the algorithms optimizing replica placement 
and request handling are not disclosed. Still, Akamai's 
infrastructure is owned and controlled by a single entity 
(Akamai), thus they do not need to solve the game-theoretic 
equivalent of our model. 

CoralCDN (16) is a p2p CDN consisting of users volun- 
tarily devoting their bandwidth and storage to redistribute 
the content. In CoralCDN the popular content is replicated 
among multiple servers (which can be viewed as relaying the 
requests); the requests for content are relayed only between 
the servers with constrained pairwise RTTs (which ensures 
the proximity of delivering server). Our mathematical model 
formalizes the intuitions behind heuristics in CoralCDN. 

[ 1 1 1 shows a CDN based on a DHT and heuristic algo- 
rithms to minimize the total processing time. Although each 
server has a fixed constrains on its load/bandwidth/storage 
capacity, the paper does not consider the relation between 
server load and its performance degradation. The evaluation 
is based on simulation; no theoretical results are included. 

The problem of mirroring in the Internet is analyzed in 
(13) , (28) . Both papers show different approaches to choos- 
ing locations for replicas so that the average network delay 
between data locations and end-users is minimized. The 
impact of servers' congestion is not taken into consideration. 

IX. Conclusions 

In this paper we present and analyze a model of a 
distributed system that minimizes the request processing 
time by distributing the requests among servers. Existing 
models assume that the processing time is dominated either 
by the network communication delay or by congestion of 
servers. In contrast, in our model, the observed latency is 
the sum of the two delays: network delay and congestion. 
Our model can be used in different kinds of problems in 
distributed systems, ranging from routing in content delivery 
networks to load balancing in a cloud of servers. 

We show that the problem of minimizing the total process- 
ing time can be stated as an optimization problem in the con- 
tinuous domain. We prove that the problem is polynomially 
solvable; but, because of 0(m 6 ) complexity, standard solvers 
are not practical. We propose a distributed algorithm that, 
according to our experimental evaluation, even in a network 
consisting of thousands of servers requires only a dozen 
of messages sent by each server to converge to a solution 
worse than at most 0.1% of the optimum (not counting the 
gossiping to exchange the information). We show how to 
estimate the distance between the current solution found by 
the algorithm and the optimal solution. The estimation is 
difficult in practice, as it requires solving the subproblem 
of finding the maximal flow of the minimal cost in a graph. 
However, the distributed algorithm still outperforms standard 



10 



optimization techniques. Based on the experiments, we argue 
that in practice this part of the algorithm can be omitted, as 
it does not influence the algorithm efficiency. 

We also analyze how the lack of coordination influences 
the total processing time. We give theoretical bounds for 
the price of anarchy for homogeneous networks and high 
average loads. Additionally, we assess the price of anarchy 
experimentally on heterogenous networks. In both cases the 
price of anarchy is low (1 + 7^ in the theoretical analysis, 
and below 1.15 in the experiments). 

Our results — the low price of anarchy and an efficient 
distributed optimization algorithm — indicate that a fully 
distributed query processing system can be efficient. Thus, 
instead of buying services from dedicated cloud providers 
or CDN operators, smaller organizations, such as ISPs or 
universities, can gather in consortia effectively serving the 
participators' needs. 
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Appendix 
Removing Negative Cycles 

The problem of negative cycles removal can be reduced 
to finding the maximal flow of the minimal cost in a graph. 
The problem of finding the maximal flow of the minimum 
cost is well studied in the literature; In particular the auction 
algorithms [7| and the approximation method for finding 
minimum circulation [ 18] are the examples of the distributed 
algorithms solving the problem. 

For the purpose of proving the reduction we introduce 
the following notation. out(p',i) denotes the total amount 
of requests that in a partial solution p' are relied by a server 
i to all other servers: out(p',i) —Y,j^i r ij- in{p' ,i) denotes 
the total amount of requests that in p' are relied by all other 
servers to i, in(p',i) — Y.j^ifji- 

For each server i we introduce two graph vertices: the 
front if and the back if,. There are two additional vertices: s 
(source) and t (target). The source s is linked with each front 
node, if with an edge (s, if) with zero cost and capacity 
equal to out(p',i). Each back node, if, is linked with the 
target t with an edge (ib,t) with zero cost and capacity equal 
to in(p',i b ). 

There are also edges between front and back nodes: for 
each pair (if,jb), i ^ j there is an edge with cost equal to 
Cij and infinite capacity. 

The maximal flow of the minimal cost / between s and 
t can be mapped to a new partial solution p": a flow on 
an edge (if, jb) corresponds to server i relying fu 
of its own requests to server j. Observe that, as capacity 
(s,if) = out(p' ,i) and capacity (//,,?) = in(p' ,/), the load of 
/-th server in p" is equal to its load in p. 



tb 




,t b ) 




h 


«(•> 


,t b ) 


n 


a 




I 1 


a 


10 KB/s 


0.0 


0.0 




0.2 MB/s 


0.0 


0.37 


20 KB/s 


-0.05 


0.21 




0.5 MB/s 


0.28 


0.8 


50 KB/s 


-0.05 


0.27 




2 MB/s 


0.45 


1.31 


0.1 MB/s 


-0.08 


0.33 




5 MB/s 


0.18 


0.8 



Table IV: The relative deviation of the average throughput caused by the 
increase of the background load (after removal of 5% largest deviations). 



Appendix 
Validation of the constant latency 

We experimentally verified how the amount of the load 
sent over the network influences the communication delay 
between the servers. We randomly selected 60 PlanetLab 
servers, scattered around Europe, and simulated different 
intensity of the background load in the following way. 
Each server choses its 5 neighbors randomly Then the 
servers start sending data with constant throughput to its 5 
neighbors. In different experiments, we used 8 values of the 
throughputs: lOKB/s, 20KB/s, 50KB/s, lOOKB/s, 200KB/s, 
500KB/s, IMB/s, 2MB/s. If a particular throughput was 
not achievable, the server was just sending data with the 
maximal achievable throughput. For each value of the back- 
ground load we calculated the average round trip time (RTT) 
between the server and each of its 5 neighbors (we used the 
average from 300 RTT samples). 

Let rtt(si,Sj,tb) denote the average rtt between servers s,- 
and sj with the background load generated with throughput 
tt,. For each pair of the servers s,- and Sj for which we 
measured the RTT, and for each value of the background 
throughput tj, we calculated the relative deviation of the 
average throughput caused by the increase of the back- 
ground load compared to the minimal throughput lOKB/s: 

e(s h sj,bt) = ^''^fs'^oKaf^^ ■ For each value of the 
background throughput, we removed 5% of the largest 
deviations and then calculated the mean from deviations 
e(sj,Sj,b t ), averaged over all pairs of servers (fi). For 
each value of the background throughput we additionally 
calculated the standard deviations (a). These results are 



presented in Table IV 



From the data we see that up to b t = 0.2MB/s, which 
corresponds to the case where each server accepts 5 • 0.2 • 8 = 
8Mb/s of incoming data, the average RTT was not influenced 
by the background throughput. This is also confirmed by the 
statistical analysis of the data run for the RTTs (instead of 
for deviations). For b t < 0.2MB/s the ANOVA test (which 
we run for the whole population - without removing 5% 
of the highest RTTs) confirmed the lack of dependency for 
over 56% of the pairs of servers. For b t < O.lMB/s (cor- 
responding to 4Mb/s of incoming throughput) the ANOVA 
test confirmed null hypothesis for over 70% of the pairs of 
servers and for b t < 50KB/s for over 90% of the pairs. We 
consider that these results strongly justify the assumption of 
a constant latency in our model. 
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